The Challenge of Unstructured Modeling
Using Graphs to Describe Model Structure
- Directed Models
- Undirected Models
- The Partition Function
- Energy-Based Models
- Separation and D-Separation
- Converting between Undirected and Directed Graphs
- Factor Graphs
Sampling from Graphical Models (Generating Sample using estimated distribution)
Advantages of Structured Modeling
Learning about Dependencies
Inference and Approximate Inference
The Deep Learning Approach to Structured Probabilistic Models (IMPORTANT!!!)
- Example: Restricted Boltzmann Machine

Introduction

A Structured probabilistic model is a way of describing a probability distribution,using a graph to describe which random variables in the probability distribution interact with each other directly.
- Graph: Vretices, Edges
Deep learning Practitionors tend to use very different model structures, learning algorithms and inference procedures

16.1 The Challenge of Unstructured Modeling

Tasks using Probabilistic Models are often more expensive than classification.
- Multiple Output Values
- c.f. Classification produces a single output, and ignores many parts of the input
Probabilistic Models => a complete understanding of the entire structure of the input
- Density Estimation
  - Base-
- Denoising
  - Robust to noise
- Missing value imputation
  - Impute a Probability over unobserved data
- Sampling
  - Generates new samples from the distribution
Requirements for the models
- Memory for representation
  - c.f. Naive Distribution: n dimension with k states for each entry => $O(k^n)$
- Statistical Efficiency
  - the number of model parameters $\uparrow$, the required amount of training data $\uparrow$
- The cost of inference
- The cost of sampling
Structured probabilistic models provide a formal framework for modeling only direct interactions between random variables.
- Fewer parameters
- Reduce computational cost
  - storing the model, inference, sampling

16.2 Using Graphs to Describe Model Structure

16.2.1 Directed Models

Known as Belief Network, Bayesian Network
Directed Edge means conditional distribution
- e.g. relay run $p(t_0, t_1, t_2) = p(t_0)p(t_1|t_0)p(t_2|t_1)$
General Case
- $p(\mathbf{x}) = \Pi_i p(x_i| Pa_\mathcal{G}(x_i))$
  - $Pa_\mathcal{G}(x_i)$ means 'Pa'rent nodes of $x_i$ in $\mathcal{G}$.
n discrete variables each having k values
- Unstructured Models: $O(k^n)$ parameters
- Structured Models: $O(k^m)$ parameters (m is the maximum number of variables in a single conditional p(...).
  - few parents in the graph -> few paramters. -> efficient
Graph encoding
- "Which variables are Conditionally independent from each other."

16.2.2 Undirected Models

Known as Markov Random Fields(MRFs), Markov Networks
Use when the interactions seem to have no intrinsic direction or to operate in both directions.
- e.g. Health of your roommate, you, your coworker
General Case
- Unnormalized Probability Distribution for Undirected Models
  - $\tilde{p}(\mathbf{x})=\Pi_{C\in \mathcal{G}} \phi(C)$
    - C is clique: a subset of nodes connected to each other.
    - $\phi(C)$ is clique potential

16.2.3 The Partition Function

Normarlized Probability Distribution for Undirected Models
- $p(x) = \frac{1}{Z}\tilde{p}(\mathbf x)$
  - $Z = \int \tilde{p}(\mathbf x) dx$
$Z$ is known as the partition function, a term borrowed from statistical physics.
- often intractable to compute.
- $\phi$ must be conducive to computing $Z$ efficiently.
- We need Approximations (CH 18)

Difference btw Directed and Undirected Models

Directed models are defined directly in terms of probaility distributions from the start, while undirected models are defined more loosely by $\phi$ functions that are then converted into probability distributions.

The domain of each of the variables has dramatic effect on the kind of probability distribution that a given set of $\phi$ corresponds to.
Often, it is possible to leverage the effect of a carefully chosen domain of a variable in order to obtain complicated behavior from a relatively simple set of $\phi$.
- e.g. Z diverge or not? $\mathbf{x} \in \mathbb{R}^n or \{ 0,1 \}^n$

16.2.4 Energy-Based Models (EBM)

$\tilde{p}(\mathbf{x}) = \exp(-E(\mathbf{x}))$
- to Make $\forall{\mathbf{x}},\tilde{p}(\mathbf{x}) > 0$
- $E(\mathbf{x})$ is Energy Function
- $\tilde{p}(\mathbf{x})$ in EBM is an example of a Boltzmann distribution.
  - Many energy-based models are called Boltzmann machines.
Different cliques in the undirected graph correspond to the different terms of the energy function.
- a EBM is just a special kind of Markov network
- One can view a EBM a Product of "Experts"(each term)
The words such as "Partition function", "Energy" are originally developed by statistical physicists.
Not compute $p_\text{model}(\mathbf{x})$ but only $\log \tilde{p}_\text{model}(\mathbf{x})$
- Free energy with latent variables $\mathbf{h}$
  - $\mathcal{F} = -\log \Sigma_h \exp(-E(\mathbf{x}, \mathbf{h}))$

16.2.5 Separation and D-Separation

Which variables indirectly interact?
Separation (for Undirected Models)
- Conditionally independence implied by the graph
- If (1) No path exists between them or (2) all paths contain an observed variable, => Separated
- Active/Inactive
  - paths involving only unobserved variables => Active => Not Separated (Dependent)
  - all paths including an observed variable => Inactive => Separated (Independent)
d-Separation (for Directed Models)
- "d-" means "dependence"
- e.g. Active, Not Separated, Dependent
- e.g. Active, Not Separated
Context-specific Independences
- Cannot be represented with Existing Graphical Models
- Independences that are present dependent on the value of some variables in the network.
- e.g.
  1. a = 0 => b and c are independent
  2. a = 1 => b = c

16.2.6 Converting between Undirected and Directed Graphs

We often refer to a specific machine learning model as being undirected or directed.
- RBM: undirected
- Sparse coding: Directed
But Every probability distribution can be represented by either a Directed or an Undirected
- No probabilistic model is inherently directed or undirected.
- Some models are most easily described using a directed (or undirected) graph
We should "Choose" which language to use for each task.
- Which approach can capture the most independences.
- Which approach uses the fewest edges
We may sometimes switch between different modeling languages fwith a single probability distribution
- Sampling: Directed (Ch16.3)
- Approximate Inference: Undirect (Ch19)
Conversion between Directed and Undirected
- Converting Directed into Undirected
  - Undirected Model cannot represent "Immorality" of Directed Model
    - Immorality
      - Common child of Multiple not-connected parents in Directed Graph
    - Moralized Graph
      - Undirected Graph for representing independence of Immorality
      - Add undirected edge between Immoral parents.
- Converting Undirected into Directed
  - Directed Model Cannot represent Undirected Model with a loop of length greater than 3.
  - Chord
    - A connection between any two non-consecutive variables in the loop.
  - Converting to Chordal or Triangulated Graph
    - Remove loops of length greater than 3.
    - Make all loops as Triangular loops
  - WHY?
    - Review "Separated in Undirected" and "d-Separated in Directed"
    - Left: Impossible to convert

16.2.7 Factor Graphs

Another way of drawing Undirected Models
- Resolve an ambiguity in the graphical representation
- Factor graphcs explicitly represent the scope of each $\phi$.

16.3 Sampling from Graphical Models

Ancestral Sampling
- Use "ONLY" Directed Graphical Models
- Sort variables $\mathbf{x}_i$ in the graph into a topological ordering.
- Sample in this order
- e.g. If $\mathbf{x}_1$ -> $\mathbf{x}_2$ -> $\mathbf{x}_3$ ... -> $\mathbf{x}_n$
  - Sample $\mathbf{x}_1 \sim P(\mathbf{x}_1)$
  - Sample $\mathbf{x}_2 \sim P(\mathbf{x}_2 | \mathbf{x}_1)$
  - ...
  - Sample $\mathbf{x}_n \sim P(\mathbf{x}_n | \mathbf{x}_{n-1})$
Ancestral Sampling for Undirected Models
- Preprocessing: Converting Undirected to Directed
- Some undirected Models cannot be converted into Directed
- Some converted Models becomes intractable.
Gibb Sampling for Undirected Models (Ch17)
- Use separation properties
- Iteratively visit each vaiable $\mathbf{x}_i$
- Not a fair sample $\mathbf{x} \sim P(\mathbf{x})$
- Focus only one variable and "Local" condition on only the neighbors sampling
- Repeat!!! -> Converges to sampling from the correct distribution
- But When we stop?

16.4 Advantages of Structured Modeling

Reduce the cost of representing probability distributions as well as learning and inference
Sampling
Easier to develop and debug
- Separate Representation of knowledge from learning of knowledge or inference given existing knowledge

16.5 Learning about Dependencies

Use Latent(or Hidden) Variables $\mathbf{h}$
Visible $\mathbf{v}$s with only direct connections VS $\mathbf{v}$s with indirect connections using $\mathbf{h}$
- Many edges VS small edges
- Large cliques VS small cliques
- Computation cost High VS Low
"Structure Learning"
- Greedy search
  - Propose a structure (with a small number of edges added or removed)
  - Train it, give a score
  - Reward accuracy and penalize model complexity
  - Repeat
- Using $\mathbf{h}$ avoids Descrete searchs and multiple rounds of training. (Effective)
  - A fixed structure over $\mathbf{v}$, $\mathbf{h}$ enough!!!
$\mathbf{h}$ also proviede an alternative representation for $\mathbf{v}$.
- Richer descriptions of the input

16.6 Inference and Approximate Inference

Inference Problem
- Predict the value of some variables given other variables
- Predict the probability distribution over some variables given the value of other variables
Unfortunately inference problems are ...
- Intractable
  - a structured graphical model is Not enough to allow efficient inference
- Computing Marginal probability of a general graphical model is NP Hard. ( Very Hard to solve )
Use approximate distribution $q(\mathbf{h}|v) \approx p(\mathbf{h}|v)$ (Ch19)

16.7 The Deep Learning Approach to Structured Probabilistic Models

Deep Graphical Models VS Traditional Graphical Models

Different design decisions for Deep Learning
- Different algorithms
- Different models
The depth of a model in Deep Graphical Models
- Depth of latent varaible $\mathbf{h}$ is...
  - The shortest path from $\mathbf{h}$ to an observed variable.
Deep Learning models...
- Use Distributed Representations
- Typically have more latent variables than observed variables.
- Focus Indirect effect
  - Capture Nonlinear interactions via Indirect connections between multiple latent variables.
Traditional Graphical Models
- Focus Direct effect
  - Capture Nonlinear interactions using High order term and Structure Learning between variables.
  - Use only few latent variables
Design of Latent variables in Deep Graphical Models
- The latent variables do not have any specific semantics
- Usually not very easy for a numan to inteprete
- c.f. In the traditional models...
  - Latent variables are designed with some specific semantics in "human" mind.
  - Less able to scale to complex problems
  - Not reusable
Connectivity typically used in Deep Graphical Models
- Have large groups of units that are all connected to other groups of units.
  - Interactions between two groups may be described by a single matrix.
  - c.f. In the traditional models...
    - few connections and the choice of connections for each variable.
Training Algorithms is "free" to model a particular dataset.
- c.f. In the traditional models...
  - The choice of inference algorithm is "tightly linked" with the design of the model structure.
Traditional approaches typically aim to "Exact inference".
- Or use "loopy belief propagation" for approximate inference algorithm. (Murphy Ch20)
- c.f. In Deep Graphical Model..
  - Gibbs sampling or variational inference algorithms.
Both approaches work well with very Sparsely connected graphs (using exact inference or loopy belief propatation).
- In case that the graphs are not spase enough
  - exact inference or loopy belief propagation are not relevant.
Deep Graphical Models In the view of Computation
- A very large latent variables makes efficient numerical code essential.
  - => implemented with efficient matrix product operations
  - => sparsely connected generalizations
    - block diagonal matrix products or convolutions
  - c.f. Traditional Model use one big matrix.(?)
Trend: The deep learning approach is often...
- Figure out what the minimum amount of information we absolutely need
- Figure out how to get a reasonable approximation of that information as quickly as possible.
- c.f. Traditional approach ...
  - Simplifying the model until computing exactly.
- Increase power of the model until it is barely possible to train or use.
  - We train Model!!!
    - Marginal distributions cannot be computed
      - However Satisfied to draw approximate samples
    - Objective function is intractable.
      - However have an estimate of the gradient.

16.7.1 Example: The Restricted Boltzmann Machine

Divied into 2 groups $\mathbf{v}$, $\mathbf{h}$
- No direct interactions between any $\mathbf{v}$ or any $\mathbf{h}$ (Restricted)
Used for Deep learning
Energy function (Induced in Ch20.2) $$\tilde{p}(\mathbf{v,h}) = \exp(-E(\mathbf{v,h}))$$ $$p(\mathbf{v,h}) = \frac{1}{\mathbf{Z}} \exp(-E(\mathbf{v,h}))$$ $$E(\mathbf{v},\mathbf{h})= -\mathbf{b}^{T} \mathbf{v}-c^{T} \mathbf{h} - \mathbf{v}^{T}\mathbf{W}\mathbf{h}$$
- Properties $$p(\mathbf{h}|\mathbf{v})=\Pi_i p(\mathrm{h}_i| \mathbf{v})$$ $$p(\mathbf{v}|\mathbf{h})=\Pi_i p(\mathrm{v}_i| \mathbf{h})$$ $$P(\mathbf{h}_i = 1| \mathbf{v}) = \sigma(\mathbf{v}^{T} \mathbf{W}_{:,i} + \mathrm{c}_{i})$$ $$P(\mathbf{h}_i = 0| \mathbf{v})= 1 - \sigma(\mathbf{v}^{T} \mathbf{W}_{:,i} + \mathrm{c}_{i})$$ $$\frac{\partial}{\partial \mathbf{W}_{i,j} }E(\mathbf{v}, \mathbf{h}) = -\mathrm{v}_i \mathrm{h}_j$$



In [ ]:

Contents

Introduction

16.1 The Challenge of Unstructured Modeling

16.2 Using Graphs to Describe Model Structure

16.2.1 Directed Models

16.2.2 Undirected Models

16.2.3 The Partition Function

16.2.4 Energy-Based Models (EBM)

16.2.5 Separation and D-Separation

16.2.6 Converting between Undirected and Directed Graphs

16.2.7 Factor Graphs

16.3 Sampling from Graphical Models

16.4 Advantages of Structured Modeling

16.5 Learning about Dependencies

16.6 Inference and Approximate Inference

16.7 The Deep Learning Approach to Structured Probabilistic Models

16.7.1 Example: The Restricted Boltzmann Machine